-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
python: Don't decrement a reference to a borrowed object #1420
Conversation
Now I have dumb questions about whether we're now leaking memory. Sounds like it should be balanced (given the decref) but if decref was only blowing up in some specific cases, that implies that in other cases the refcount was reasonable? |
The short answer is, no. The object/variable is allocated once at module start. Any memory for it that may linger will be returned when the process exits, regardless of what the reference count may be, though we are exiting, we do try and clean up properly...
Possibly for most Python versions the reference stuff could handle an imbalanced decref. Let me poke at it with GDB and see if I can find what the reference count for this thing actually is at various points... |
Yes, apologies, this patch is just masking over the issue. In fact I've no idea why this even gets rid of the crash. Looking with gdb(1) with the added incref in both cases (failure and non-failure) the reference count on the object goes to two, then to one after the decref. For whatever reason, the fact that it still has a reference when calling However, I believe the following is the real issue (though not 100% convinced it's all (or any of) our fault...) Looking at the two important lines from the error output
The first line shows we are tripping up in the following code in 463 if (is_factory) {
464 if (nxt_slow_path(PyCallable_Check(obj) == 0)) {
465 nxt_alert(task,
466 "factory \"%s\" in module \"%s\" "
467 "can not be called to fetch callable",
468 callable, module_name);
469 goto fail;
470 } That is due to 19 wsgi_invalid_callable = None from The important thing here is that The 500 fail:
501
502 Py_XDECREF(obj);
503 Py_XDECREF(module);
504
505 return NXT_ERROR;
506 } If we comment out the That marries up with the second error line
This is also demonstrable with the following test program /*
* py-none-decref-test.c - Demonstrate decref on Py_None issue
*
* gcc $(python3-config --embed --cflags) -Wall -Wextra -O0 -g -o py-none-decref-test py-none-decref-test.c $(python3-config --embed --libs --ldflags)
*/
#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
int opt;
bool decref = false;
PyObject *obj;
while ((opt = getopt(argc, argv, "d")) != -1) {
switch (opt) {
case 'd':
decref = true;
break;
default:
fprintf(stderr, "Usage: %s [-d]\n", argv[0]);
exit(EXIT_FAILURE);
}
}
Py_InitializeEx(0);
obj = Py_None;
if (decref)
Py_DECREF(obj);
Py_Finalize();
exit(EXIT_SUCCESS);
} With Python 3.12.5 $ ./py-none-decref-test
$ ./py-none-decref-test -d
$ No issues. With Python 3.11.10 $ ./py-none-decref-test
$ ./py-none-decref-test -d
Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension
Python runtime state: finalizing (tstate=0x00007f70a7e8a718)
Current thread 0x00007f70a80ddb08 (most recent call first):
<no Python frame>
Aborted (core dumped)
$ And there's our error message. (Just to be clear, the crash is happening from the call to After 34 obj = Py_None;
(gdb) p *obj
$1 = {
ob_refcnt = 3662,
ob_type = 0x7ffff7e59cc0 <_PyNone_Type>
} Notice the somewhat high reference count. However not as high as it is under 3.12 (gdb) p *obj
$1 = {
{
ob_refcnt = 4294967295,
ob_refcnt_split = {4294967295, 0}
},
ob_type = 0x7ffff7d56560 <_PyNone_Type>
} 3.12 introduced the idea of immortality for certain often used objects including With Python 3.11.10 after 36 Py_DECREF(obj); (gdb) p *obj
$2 = {
ob_refcnt = 3661,
ob_type = 0x7ffff7e59cc0 <_PyNone_Type>
} OK, it decremented the refcnt, but wasn't happy doing so... With Python 3.12.5 (gdb) p *obj
$2 = {
{
ob_refcnt = 4294967295,
ob_refcnt_split = {4294967295, 0}
},
ob_type = 0x7ffff7d56560 <_PyNone_Type>
} No effect... With Python 3.7.3 $ ./py-none-decref-test
$ ./py-none-decref-test -d
$ Before the decref (gdb) p *obj
$3 = {
ob_refcnt = 4212,
ob_type = 0x7ffff7f3a500 <_PyNone_Type>
} After... $4 = {
ob_refcnt = 4211,
ob_type = 0x7ffff7f3a500 <_PyNone_Type>
} Decremented it, but didn't cause an issue... I currently have two thoughts...
This does then beg further questions
|
Oh, and the crash isn't due to a NULL pointer dereference or anything like that, it's simply due to Python calling abort(3)... (gdb) bt
#0 __restore_sigs (set=set@entry=0x7ffef6ab8760)
at ./arch/x86_64/syscall_arch.h:40
#1 0x00007fe47cd71702 in raise (sig=sig@entry=6) at src/signal/raise.c:11
#2 0x00007fe47cd40be8 in abort () at src/exit/abort.c:11
#3 0x00007fe47c8eb1b5 in fatal_error_exit (status=-1)
at Python/pylifecycle.c:2626
#4 fatal_error (fd=fd@entry=2, header=header@entry=0,
prefix=prefix@entry=0x0, msg=msg@entry=0x0, status=status@entry=-1)
at Python/pylifecycle.c:2737
#5 0x00007fe47c8eb9ed in _Py_FatalErrorFormat (
func=0x7fe47cabcf08 <__func__.3.lto_priv.0> "none_dealloc",
format=0x7fe47ca628f0 "%s: bug likely caused by a refcount error in a C extension") at Python/pylifecycle.c:2857
... |
Entirely my fault. This patch is plain wrong.
After chatting on the #python IRC channel, it seems we should not be DECREFing without having done an INCREF, even though this Now to see the best way to deal with this... |
So in our actual case the problem is that we do 451 obj = PyDict_GetItemString(PyModule_GetDict(module), callable);
502 Py_XDECREF(obj); Which then blows up in |
I think this is the simplest solution diff --git a/src/python/nxt_python.c b/src/python/nxt_python.c
index 7bbf3d49..c7ba0a90 100644
--- a/src/python/nxt_python.c
+++ b/src/python/nxt_python.c
@@ -462,6 +462,7 @@ nxt_python_set_target(nxt_task_t *task, nxt_python_target_t *target,
"factory \"%s\" in module \"%s\" "
"can not be called to fetch callable",
callable, module_name);
+ Py_INCREF(obj); /* borrowed reference */
goto fail;
}
|
On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash triggered by Py_Finalize() in nxt_python_atexit() when running one of our pytests, namely test/test_python_factory.py::test_python_factory_invalid_callable_value 2024/09/12 15:07:29 [alert] 5452#5452 factory "wsgi_invalid_callable" in module "wsgi" can not be called to fetch callable Fatal Python error: none_dealloc: deallocating None: bug likely caused by a refcount error in a C extension Python runtime state: finalizing (tstate=0x00007f560b88a718) Current thread 0x00007f560bde7ad0 (most recent call first): <no Python frame> 2024/09/12 15:07:29 [alert] 5451#5451 app process 5452 exited on signal 6 (core dumped) This was due to obj = PyDict_GetItemString(PyModule_GetDict(module), callable); in nxt_python_set_target() which returns a *borrowed* reference, then due to the test meaning this is a `None` object we `goto fail` and call Py_DECREF(obj); which then causes `Py_Finalize()` to blow up. The simple fix is to just increment its reference count before the `goto fail`. Note: This problem only showed up under (the various versions of Python we test on); 3.11.9 & 3.11.10. It doesn't show up under; 3.6, 3.7, 3.9, 3.10, 3.12 Cc: Konstantin Pavlov <thresh@nginx.com> Closes: nginx#1413 Fixes: a9aa9e7 ("python: Support application factories") Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
Re-work patch for proper fix
|
On some Python 3.11 systems, 3.11.9 & 3.11.10, we were seeing a crash
triggered by Py_Finalize() in nxt_python_atexit() when running one of
our pytests, namely
test/test_python_factory.py::test_python_factory_invalid_callable_value
This was due to
in nxt_python_set_target() which returns a borrowed reference, then
due to the test meaning this is a
None
object wegoto fail
and callwhich then causes
Py_Finalize()
to blow up.The simple fix is to just increment its reference count before the
goto fail
.Note: This problem only showed up under (the various versions of Python
we test on); 3.11.9 & 3.11.10. It doesn't show up under; 3.6, 3.7, 3.9,
3.10, 3.12